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In many important statistical applications, the number of vari- 
ables or parameters p is much larger than the number of observations 
n. Suppose then that we have observations y = Xf3 + z, where /3 G R*" 
is a parameter vector of interest, X is a data matrix with possibly 
far fewer rows than columns, n<^p, and the Zi's are i.i.d. N(0,a^). 
Is it possible to estimate (3 reliably based on the noisy data y? 

To estimate /3, we introduce a new estimator — we call it the Dantzig 
selector — which is a solution to the ^i-regularization problem 



where r is the residual vector y — Xp and f is a positive scalar. We 
show that if X obeys a uniform uncertainty principle (with unit- 
normed columns) and if the true parameter vector (3 is sufficiently 
sparse (which here roughly guarantees that the model is identifiable), 
then with very large probability. 



Our results are nonasymptotic and we give values for the constant C. 
Even though n may be much smaller than p, our estimator achieves 
a loss within a logarithmic factor of the ideal mean squared error one 
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min ||/3|[«i subjectto \\X*r\\t^<{l + t ^)^21ogp-cr, 
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would achieve with an oracle which would supply perfect information 
about which coordinates are nonzero, and which were above the noise 
level. 

In multivariate regression and from a model selection viewpoint, 
our result says that it is possible nearly to select the best subset of 
variables by solving a very simple convex program, which, in fact, 
can easily be recast as a convenient linear program (LP). 



1. Introduction. In many important statistical applications, the number 
of variables or parameters p is now much larger than the number of observa- 
tions n. In radiology and biomedical imaging, for instance, one is typically 
able to collect far fewer measurements about an image of interest than the 
unknown number of pixels. Examples in functional MRI and tomography 
all come to mind. High dimensional data frequently arise in genomics. Gene 
expression studies are a typical example: a relatively low number of obser- 
vations (in the tens) is available, while the total number of genes assayed 
(and considered as possible regressors) is easily in the thousands. Other ex- 
amples in statistical signal processing and nonparametric estimation include 
the recovery of a continuous-time curve or surface from a finite number of 
noisy samples. Estimation in this setting is generally acknowledged as an 
important challenge in contemporary statistics; see the recent conference 
held in Leiden, The Netherlands (September 2002), "On high-dimensional 
data p ^ n in mathematical statistics and bio-medical applications." It is 
believed that progress may have the potential for impact across many areas 
of statistics [30] . 

In many research fields then, scientists work with data matrices with many 
variables p and comparably few observations n. This paper is about this 
important situation, and considers the problem of estimating a parameter 
f3 G from the linear model 



y G is a vector of observations, X is an n x p predictor matrix, and z 
a vector of stochastic measurement errors. Unless specified otherwise, we 
will assume that z ~ N{0,a'^ln) is a vector of independent normal random 
variables, although it is clear that our methods and results may be extended 
to other distributions. Throughout this paper, we will of course typically 
assume that p is much larger than n. 

1.1. Uniform uncertainty principles and the noiseless case. At first, re- 
liably estimating /3 from y may seem impossible. Even in the noiseless case, 
one may wonder how one could possibly do this, as one would need to 
solve an underdetermined system of equations with fewer equations than 
unknowns. But suppose now that /3 is known to be structured in the sense 




y = Xp + z; 
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that it is sparse or compressible. For example, suppose that [3 is S'-sparse 
so that only S of its entries are nonzero. This premise radically changes the 
problem, making the search for solutions feasible. In fact, [13] showed that 
in the noiseless case, one could actually recover (3 exactly by solving the 
convex program := Ef=i IAD 

(1.2) (Pi) min subject to X/3 = y, 

/3gRp 

provided that the matrix X G R"^p obeys a uniform uncertainty principle. 
(The program (Pi) can even be recast as a linear program.) That is, ^l- 
minimization finds without error both the location and amplitudes — which 
we emphasize are a priori completely unknown — of the nonzero components 
of the vector (3 € R*'. We also refer the reader to [20, 24] and [27] for inspiring 
early results. 

To understand the exact recovery phenomenon, we introduce the notion 
of uniform uncertainty principle (UUP) proposed in [14] and refined in [13]. 
This principle will play an important role throughout, although we empha- 
size that this paper is not about the exact recovery of noiseless data. The 
UUP essentially states that the nxp measurement or design matrix X obeys 
a "restricted isometry hypothesis." Let Xt, T C {1, . . . be the n x |T| 
submatrix obtained by extracting the columns of X corresponding to the 
indices in T; then [13] defines the S'-restricted isometry constant 5s of X 
which is the smallest quantity such that 

(1-3) {1-6s)\\c\\1<\\Xtc\\1<{1 + 5s)\\c\\1 

for all subsets T with |T| < S and coefficient sequences {cj)j^T- This prop- 
erty essentially requires that every set of columns with cardinality less than 
S approximately behaves like an orthonormal system. It was shown (also in 
[13]) that if S obeys 

(1.4) 55 + '52S + 535 < 1, 

then solving (Pi) recovers any sparse signal j3 with support size obeying 
|T| <5. 

Actually, [13] derived a slightly stronger result. Introduce the 5, S'-restricted 
orthogonality constants 6s, s' for 5-|-5' < p to be the smallest quantities such 
that 

(1-5) \{Xtc,Xt'c')\<9s,s'-\\c\\i,\\c'\U, 

holds for all disjoint sets T, T' C {1, . . . ,p} of cardinality |T| < 5 and |P'| < 
S'. Small values of restricted orthogonality constants indicate that disjoint 
subsets of covariates span nearly orthogonal subspaces. Then the authors 
showed that the recovery is exact, provided 

(1.6) 6s + es,s + 0s,2S<l, 

which is a little better since it is not hard to see that 6s+s' ~ ^S' ^ ^5,5' ^ 
6s+s' for S' > S ([13], Lemma 1.1). 
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1.2. Uniform uncertainty principles and statistical estimation. Any real- 
world sensor or measurement device is subject to at least a small amount 
of noise. And now one asks whether it is possible to reliably estimate the 
parameter (3 € from the noisy data y G R" and the model (1.1). Frankly, 
this may seem like an impossible task. How can one hope to estimate /3, 
when, in addition to having too few observations, these are also contami- 
nated with noise? 

To estimate (3 with noisy data, we consider, nevertheless, solving the 
convex program 

(1.7) (DS) min subject to \\X*r\U^ := sup \iX*r)i\ < Xp ■ a 

peRp i<«<p 

for some Ap > 0, where r is the vector of residuals 

(1.8) r = y-Xp. 

In other words, we seek an estimator /5 with minimum complexity (as mea- 
sured by the ^i-norm) among all objects that are consistent with the data. 
The constraint on the residual vector imposes that for each i G {1, . . . 
I {X*r)i\ < Xp-a, and guarantees that the residuals are within the noise level. 
As we shall see later, this proposal makes sense provided that the columns 
of X have the same Euclidean size and in this paper we will always assume 
they are unit-normed; our results would equally apply to matrices with dif- 
ferent column sizes — one would only need to change the right-hand side to 
|(X*r)j| less or equal to Ap • fj times the Euclidean norm of the ith column 
of X, or to |(X*r)j| < + 5i • Ap • a since all the columns have norm less 
than ^/l -\- 6i. 

There are many reasons why one would want to constrain the size of 
the correlated residual vector X*r rather than the size of the residual vec- 
tor r. Suppose that an orthonormal transformation is applied to the data, 
giving y' = Uy, where U*U is the identity. Clearly, a good estimation pro- 
cedure for estimating /? should not depend upon U (after all, one could 
apply U* to return to the original problem). It turns out that the esti- 
mation procedure (1.7) is actually invariant with respect to orthonormal 
transformations applied to the data vector since the feasible region is in- 
variant: {UXy{UX(5 — Uy) = X*{XP — y). In contrast, had we defined the 
feasibility region with supj |rj| being smaller than a fixed threshold, then 
the estimation procedure would not be invariant. There are other reasons 
aside from this. One of them is that we would obviously want to include in 
the model explanatory variables that are highly correlated with the data y. 
Consider the situation in which a residual vector is equal to a column X* of 
the design matrix X. Suppose, for simplicity, that the components of X* all 
have about the same size, that is, about l/\/n, and assume that a is slightly 
larger than l/i/n. Had we used a constraint of the form supj < A„cj (with 
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perhaps A„ of size about \/2 logn), the vector of residuals would be feasible, 
which does not make any sense. In contrast, such a residual vector would 
not be feasible for (1.7) for reasonable values of the noise level, and the ith 
variable would be rightly included in the model. 

Again, the program (DS) is convex, and can easily be recast as a linear 
program (LP), 

min Uj subject to — u < P < u and 

(1-9) 

-Xpal <X*{y- X(3) < Xpal, 

where the optimization variables are u,f3 £ KP ^ and 1 is a p-dimensional 
vector of ones. Hence, our estimation procedure is computationally tractable; 
see Section 4.4 for details. There is indeed a growing family of ever more 
efficient algorithms for solving such problems (even for problems with tens 
or even hundreds of thousands of observations) [8]. 

We call the estimator (1.7) the Dantzig selector; with this name, we intend 
to pay tribute to the father of linear programming who passed away while we 
were finalizing this manuscript, and to underscore that the convex program 
{DS) is effectively a variable selection technique. 

The first result of this paper is that the Dantzig selector is surprisingly 
accurate. 

Theorem 1.1. Suppose (3 ^YiP is any S -sparse vector of parameters 
obeying 623 + &S,2S < 1- Choose Xp = \/21ogp in (1.7). Then with large prob- 
ability, obeys 

(1.10) \\P-P\\l<C^.{2logp)-S-a\ 

with Ci = 4/(1 — — 9s,2s)- Hence, for small values of 5s + 0s,2S, Ci ^ 4. 
For concreteness, if one chooses Xp := \/2{l + a) logp for each a > 0, the 
bound holds with probability exceeding 1 — (-v/vrlogp • p"')~^ with the proviso 
that Xp substitutes 21ogp in (1.10). 

We will discuss the condition 62s + ^s,2S < 1 later but, for the moment, 
observe that (1.10) describes a striking phenomenon: not only are we able to 
reliably estimate the vector of parameters from limited observations, but the 
mean squared error is simply proportional — up to a logarithmic factor — to 
the true number of unknowns times the noise level cr^. What is of interest 
here is that one can achieve this feat by solving a simple linear program. 
Moreover, and ignoring the log-like factor, statistical common sense tells us 
that (1.10) is, in general, unimprovable. 

To see why this is true, suppose one had available an oracle letting us know 
in advance the location of the S nonzero entries of the parameter vector. 
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that is, To := {i:(3i^ 0}. That is, in the language of model selection, one 
would know the right model ahead of time. Then one could use this valuable 
information and construct an ideal estimator P* by using the least-squares 
projection 

where is the restriction of /?* to the set Tq, and set (5* to zero outside 
of Tq. (At times, we will abuse notation and also let /3/ be the truncated 
vector equal to Pi for i G / and zero otherwise.) Clearly, 



p*=p+{xixT,r'xiz 



and 

E||r - PWl = E|| {x'i:^XT,)-'x'i:^z\\l = a' TT{{xiXT,r 

Now since all the eigenvalues of Xj,^^ Xtq belong to the interval [1 — iJ^ , 1 + , 
the ideal expected mean squared error would obey 

nw-p\\l>-^^-s-a\ 

Hence, Theorem 1.1 says that the minimum (.1 estimator achieves a loss 
within a logarithmic factor of the ideal mean squared error; the logarithmic 
factor is the price we pay for adaptivity, that is, for not knowing ahead of 
time where the nonzero parameter values actually are. 

In short, the recovery procedure, although extremely nonlinear, is stable 
in the presence of noise. This is especially interesting because the matrix X 
in (1.1) is rectangular; it has many more columns than rows. As such, most 
of its singular values are zero. In solving (DS), we are essentially trying to 
invert the action of X on our hidden /3 in the presence of noise. The fact 
that this matrix inversion process keeps the perturbation from "blowing 
up" — even though it is severely ill-posed — is perhaps unexpected. 

Presumably, our result would be especially interesting if one could esti- 
mate the order of n parameters with as few as n observations. That is, we 
would like the condition + ^5,25 < 1 to hold for very large values of 5, for 
example, as close as possible to n (note that for 25 > n, S2S > 1 since any 
submatrix with more than n columns must be singular, which implies that 
in any event S must be less than n/2). Now, this paper is part of a larger 
body of work [11, 13, 14] which shows that, for "generic" or random design 
matrices X , the condition holds for very significant values of S. Suppose, for 
instance, that X is a random matrix with i.i.d. Gaussian entries. Then with 
overwhelming probability, the condition holds for S = 0{n/\og{p/n)). In 
other words, this setup only requires 0(log(p/n)) observations per nonzero 
parameter value; for example, when n is a nonnegligible fraction of p, one 
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only needs a handful of observations per nonzero coefficient. In practice, this 
number is quite small, as few as 5 or 6 observations per unknown generally 
suffice (over a large range of the ratio p/n); see Section 4. Many design 
matrices have a similar behavior and Section 2 discusses a few of these. 

As an aside, it is interesting to note that, for S obeying the condition 
of the theorem, the reconstruction from noiseless data {a = 0) is exact and 
that our condition is slightly better than (1.6). 

1.3. Oracle inequalities. Theorem 1.1 is certainly noticeable but there 
are instances, however, in which it may still be a little naive. Suppose, for 
example, that [3 is very small so that j3 is well below the noise level, that 
is, \f3i \ <C cr for all i. Then with this information we could set /3 = 0, and the 
squared error loss would then simply be X]?=i lAPj which may potentially 
be much smaller than cj^ times the number of nonzero coordinates of /3. In 
some sense, this is a situation in which the squared bias is much smaller 
than the variance. 

A more ambitious proposal might then ask for a near-optimal trade-off 
coordinate by coordinate. To explain this idea, suppose, for simplicity, that 
X is the identity matrix so that y ~ N{P, a'^Ip). Suppose then that we had 
available an oracle letting us know ahead of time which coordinates of /? are 
significant, that is, the set of indices for which \(3i\ > a. Then equipped with 
this oracle, we would set [3* = yi for each index in the significant set and 
P* = otherwise. The expected mean squared error of this ideal estimator 
is then 

(1.11) B\\P*-P\\l=j2^m{(3la'). 

1=1 

Here and below, we will refer to (1-11) as the ideal MSE. As is well known, 
thresholding rules with threshold level at about ^/2\ogp ■ a achieve the ideal 
MSE to within a multiplicative factor proportional to logp [21, 22]. 

In the context of the linear model, we might think about the ideal esti- 
mation as follows: consider the least-squares estimator /3/ = {Xj Xj)~^Xfy 
as before and consider the ideal least-squares estimator /3* which minimizes 
the expected mean squared error 

r = argmin Ell/3 

/C{l,...,p} 

In other words, one would fit all least-squares models and rely on an oracle 
to tell us which model to choose. This is ideal because we can of course not 
evaluate E||/3 — /J/Hl^ since we do not know f] (we are trying to estimate it 
after all). But we can view this as a benchmark and ask whether any real 
estimator would obey 

(1-12) 11/3 -/3|||, = 0(logp)- Ell/3 -/3*|||^ 
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with large probability. 

In some sense, (1.11) is a proxy for the ideal risk E||/3 — /^^Hf^- Indeed, 
let / be a fixed subset of indices and consider regressing y onto this subset 
(we again denote by (3i the restriction of (3 to the set /). The error of this 
estimator is given by 

Wi-p\\l = m-mi + Wi-fi\\l- 

The first term is equal to 

/3/ - /?/ = {XjXj)-^XfX(3ic + {XjXiy^Xjz, 
and its expected mean squared error is given by the formula 

nPi-Pif = \\{XjXj)-'xfX(3j4l+a^TTi{XjXjr'). 
Thus, this term obeys 

B\\i3i-Pif>-^—-\I\-a^ 

1 + d\j\ 

for the same reasons as before. In short, for all sets /, |/| < S, with 5s < 1, 
say, 

E||/3/-/?f >i-fE/3' + m-^'l 

Vie/': / 

which gives that the ideal mean squared error is bounded below by 
E||r - > I ■ ( E (3f + \I\-A=^-Y^ min(A^ a^). 

Kiel'' / i 

In that sense, the ideal risk is lower bounded by the proxy (1.12). As we 
have seen, the proxy is meaningful since it has a natural interpretation in 
terms of the ideal squared bias and variance, 

5]min(/?2,a2)= min - + |/| . 

^ 7C{l,...,p} 

This raises a fundamental question: given data y and the linear model (1.1), 
not knowing anything about the significant coordinates of (3 and not being 
able to observe directly the parameter values, can we design an estimator 
which nearly achieves (1.12)? Our main result is that the Dantzig selector 
(1.7) does just that. 

Theorem 1.2. Choose t > and set \p := (1 + t~^)^/2logp in (1.7). 
Then if [3 is S-sparse with 62s + ^5,25 < 1 — t, our estimator obeys 

(1.13) 11/3 - < Cl -Xl-l^a^ + j^ min(/32, ^2)^ 
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with large probability [the probability is as before for Xp := (\/l + a + t ^) x 
\/2 logp]. Here, C2 may only depend on 62s and ^5,25; see below. 

We emphasize that (1.13) is nonasymptotic and our analysis actually 
yields explicit constants. For instance, we also prove that 



i-5-e {i-d-ey i-5-e 

and 

where above and below, we put 5 := 62s and 9 := ^5,25 for convenience. For 
6 and small, C2 is close to 

C2 ~ 2(4^/2 + 1 + I/V2) + 1 < 16. 

The condition imposing 62s + ^5,25 < 1 (or less than l — t) has a rather natu- 
ral interpretation in terms of model identifiability. Consider a rank deficient 
submatrix XtuT' with 2S columns (lowest eigenvalue is = 1 — 623), and 
with indices in T and T' , each of size S. Then there is a vector h obeying 
Xh = and which can be decomposed as h = (3 — (5' , where f3 is supported 
on T and likewise for /?'; that is, 

In short, this says that the model is not identifiable since both (3 and (3' are 
S-sparse. In other words, we need ^25 < 1- The requirement ^25 + ^s,2S < 1 
(or less than l — t) is only slightly stronger than the identifiability condition, 
roughly two times stronger. It puts a lower bound on the singular values 
of submatrices and, in effect, prevents situations where multicollinearity 
between competitive subsets of predictors could occur. 

1.4. Ideal model selection by linear programming. Our estimation pro- 
cedure is of course an implicit method for choosing a desirable subset of 
predictors, based on the noisy data y = Xfi + z, from among all subsets of 
variables. As the reader will see, there is nothing in our arguments that re- 
quires p to be larger than n and, thus, the Dantzig selector can be construed 
as a very general variable selection strategy — hence, the name. 

There is of course a huge literature on model selection, and many pro- 
cedures motivated by a wide array of criteria have been proposed over the 
years — among which [1, 7, 26, 31, 36]. By and large, the most commonly dis- 
cussed approach — the "canonical selection procedure" according to [26] — is 
defined as 



(1.15) argmin||y-X/3||| +A.c72.||/3||,„, := |{i : A / 0}|, 



«2 ^ " • ''^ 

/3GRP 
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which best trades-off between the goodness of fit and the complexity of the 
model, the so-called bias and variance terms. Popular selection procedures 
such as AIC, Cp, BIC and RIC are all of this form with different values of 
the parameter A; see also [2, 4, 5, 6, 7] for related proposals. To make a 
long story short, model selection is an important area in part because of 
the thousands of people routinely fitting large linear models or designing 
statistical experiments. As such, it has and still receives a lot of attention, 
and progress in this field is likely to have a large impact. Now despite the 
size of the current literature, we believe there are two critical problems in 
this field: 

• First, finding the minimum of (1.15) is in general iVP-hard [32]. To the 
best of our knowledge, solving this problem essentially requires exhaustive 
searches over all subsets of columns of X, a procedure which clearly is 
combinatorial in nature and has exponential complexity since for p of size 
about n, there are about 2^ such subsets. (We are of course aware that in 
a few exceptional circumstances, e.g., when X is an orthonormal matrix, 
the solution is computationally feasible and given by thresholding rules 
[7, 21].) 

In other words, solving the model selection problem might be possible 
only when p ranges in the few dozens. This is especially problematic when 
one considers that we now live in a "data-driven" era marked by ever larger 
datasets. 

• Second, estimating (3 and XP — especially when p is larger than n — are two 
very different problems. Whereas there is an extensive literature about the 
problem of estimating Xf3, the quantitative literature about the equally 
important problem of estimating (3 in the modern setup where p is not 
small compared to n is scarce; see [25]. For completeness, important and 
beautiful results about the former problem (estimating XP) include the 
papers [3, 4, 6, 7, 23, 26]. 

In recent years, researchers have of course developed alternatives to overcome 
these computational difficulties, and we would like to single out the popular 
lasso also known as Basis Pursuit [15, 38], which relaxes the counting norm 
WPWeo into the convex ^i-norm U/SJl^i . Notwithstanding the novel and exciting 
work of [28] on the persistence of the lasso for variable selection in high 
dimensions — which again is about estimating XP and not /3 — not much is 
yet known about the performance of such strategies although they seem to 
work well in practice; for example, see [35] . 

Against this background, our work clearly marks a significant departure 
from the current literature, both in terms of what it achieves and of its 
methods. Indeed, our paper introduces a method for selecting variables based 
on linear programming, and obtains decisive quantitative results in fairly 
general settings. 
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1.5. Extension to nearly sparse parameters. We have considered thus 
far the estimation of sparse parameter vectors, that is, with a number S of 
nonzero entries obeying ^25 + 6*5,25- We already explained that this condi- 
tion is in some sense necessary as otherwise one might have an "aliasing" 
problem, a situation in which Xp ~ XP' , although /3 and /?' might be com- 
pletely different. However, extensions of our results to nonsparse objects are 
possible provided that one imposes other types of constraints to remove the 
possibility of strong aliasing. 

Many such constraints may exist and we consider one of them which 
imposes some decay condition on the entries of f3. Rearrange the entries of 
P by decreasing order of magnitude |/3(i)| > |/3(2)| > • • • > and suppose 
the kth largest entry obeys 

(1.16) |/3(fc)|<i^•A:-^/^ 

for some positive R and s < 1, say. Can we show that our estimator achieves 
an error close to the proxy (1.11)? The first observation is that, to mimic 
this proxy, we need to be able to estimate reliably all the coordinates which 
are significantly above the noise level, that is, roughly such that > a. 
Let S = \{i: > a}\. Then if ^25 + ^s,2S < this might be possible, but 
otherwise, we may simply not have enough observations to estimate that 
many coefficients. The second observation is that for P E obeying (1.16), 

(1.17) J2^miPla^) = S-a^+ ^ \P^,)\' < C ■ {S ■ + R'S-^n 

i i>S+l 

with r = l/s — 1/2. With this in mind, we have the following result. 

Theorem 1.3. Suppose P G R*^ obeys (1.16) and let 5* be fixed such 
that (^25* +^5,. 25* < 1- Choose Xp as in Theorem 1.1. Then P obeys 

(1.18) \\P - PWl < min Cs ■ 2 logp ■ {S ■ + R^S-'n 
with large probability. 

Note that for each P obeying (1.16), \{i:\Pi\ > a}\ < (/^/c^)^/^ Then if 
5"* > (R/a)^/^, it is not hard to see that (1.18) becomes 

(1.19) \\P - PWl < O(logp) • . (^2)2./(2r+l)^ 

which is the well-known minimax rate for classes of objects exhibiting the 
decay (1.16). Even though we have n <Cp, the Dantzig selector recovers the 
minimax rate that one would get if we were able to measure all the coordi- 
nates of P directly via y ~ N{P,a'^Ip). In the case where 5* < (R/a)'^/', the 
method saturates because we do not have enough data to recover the mini- 
max rate, and can only guarantee a squared loss of about 0{logp){R'^ S^'^^ + 
S■^, • (7^). Note, however, that the error is well controlled. 
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1.6. Variations and other extensions. When X is an orthogonal matrix, 
the Dantzig selector (5 is then the ^i-minimizer subject to the constraint 
\\X*y — < Ap • (T. This implies that (3 is simply the soft-thresholded 
version of X*y at level Ap • fi; thus, 

Pi = max(|(X*y),| -\p-a, 0) sgn((X*y),). 

In other words, X*y is shifted toward the origin. In Section 4 we will see 
that for arbitrary X^s the method continues to exhibit a soft-thresholding 
type of behavior and as a result, may slightly underestimate the true value 
of the nonzero parameters. 

There are several simple methods which can correct for this bias and 
increase performance in practical settings. We consider one of these based 
on a two-stage procedure: 

1. Estimate / = {i : /3j 7^ 0} with I = {i : Pi ^ 0} with /3 as in (1.7) (or, more 
generally, with I = {i : \Pi\ > a ■ a} for some a > 0). 

2. Construct the estimator 

(1.20) /?; = {XjXjr'xJy, 

and set the other coordinates to zero. 

Hence, we rely on the Dantzig selector to estimate the model /, and construct 
a new estimator by regressing the data y onto the model /. We will refer 
to this variation as the Gauss-Dantzig selector. As we will see in Section 4, 
this recenters the estimate and generally yields higher statistical accuracy. 
We anticipate that all our theorems hold with some such variations. 

Although we prove our main results in the case where z is a vector of 
i.i.d. Gaussian variables, our methods and results would certainly extend to 
other noise distributions. The key is to constrain the residuals so that the 
true vector /3 is feasible for the optimization problem. In details, this means 
that we need to set Ap so that Z* = supj | (X* ,z)\ is less than ApO" with large 
probability. When z ~ A^(0, a'^In), this is achieved for Ap = \/2Togp, but one 
could compute other thresholds for other types of zero-mean distributions 
and derive results similar to those introduced in this paper. In general setups, 
one would perhaps want to be more flexible and have thresholds depending 
upon the column index. For example, one could declare that r is feasible if 
supj |(X',r)|/Ap is below some fixed threshold. 

1.7. Organization of the paper. The paper is organized as follows. We 
begin by discussing the implications of this work for experimental design in 
Section 2. We prove our main results, namely. Theorems 1.1, 1.2 and 1.3, 
in Section 3. Section 4 introduces numerical experiments showing that our 
approach is effective in practical applications. Finally, Section 5 closes the 
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paper with a short summary of our findings and of their consequences for 
model selection, and with a discussion of other related work in Section 5.2. 
Finally, the Appendix provides proofs of key lemmas supporting the proof 
of Theorem 1.2. 

2. Significance for experimental design. Before we begin proving our 
main results, we would like to explain why our method might be of interest 
to anyone seeking to measure or sense a sparse high-dimensional vector us- 
ing as few measurements as possible. In the noiseless case, our earlier results 
showed that if (3 is S'-sparse, then it can be reconstructed exactly from n 
measurements y = XP, provided that 6 + 6 < 1 [11, 13]. These were later 
extended to include wider classes of objects, that is, the so called compress- 
ible objects. Against this background, our results show that the Dantzig 
selector is robust against measurement errors (no realistic measuring device 
can provide infinite precision), thereby making it well suited for practical 
applications. 

2.1. Random matrices and designs. An interesting aspect of this theory 
is that random matrices X are in some sense ideal for recovering an object 
from a few projections. For example, if X is a properly normalized Gaussian 
matrix with i.i.d. entries, then the conditions of our theorems hold with 

(2.1) 5xn/log(p/n) 

with overwhelming probability [13, 14, 18, 37]. The same relation is also 
conjectured to be true for other types of random matrices such as normalized 
binary arrays with i.i.d. entries taking values ±1 with probability 1/2. Other 
interesting strategies for recovering a sparse signal in the time domain might 
be to sense a comparatively small number of its Fourier coefficients. In fact, 
[14] show that in this case our main condition holds with 

S >in/ log^p, 

for nearly all subsets of observed coefficients of size n. Vershynin has in- 
formed us that S X n/log^ p also holds, and we believe that S x n/logp is 
also true. More generally, suppose that X is obtained by randomly sampling 
n rows of a p by p orthonormal matrix U (and renormalizing the columns 
so that they are unit-normed). Then we can take 

5xn/[/i2 logSp], 

with /i the coherence fi := supj^ \/n\Uij\ [14]. 

Of course, all these calculations have implications for random designs. 
For example, suppose that in an idealized application one could — in a first 
experiment — observe (3 directly and measure y^^^ ~ N{f3,a^Ip). Consider 
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a second experiment where one measures instead y ~ N {X (3 , a'^ In) , where 
X is a renormahzed random design matrix with i.i.d. entries taking values 
±1 with probabihty 1/2. Suppose that the signal is S-sparse (note that 
X ||/3||^2)- Then reversing (2.1), we see that with about 

n X 5" • log(p/S') 

observations, one would get just about the same mean squared error that one 
would achieve by measuring all the coordinates of f3 directly (and applying 
thresholding) . 

Such procedures are not foreign to statisticians. Combining parameters 
by random design or otherwise goes back a long way; see, for example, the 
long history of blood pooling strategies. The theoretical analysis needs of 
course to be validated with numerical simulations, which may give further 
insights about the practical behavior of our methods. Section 4 presents a 
first series of experiments to complement our study. 

2.2. Applications. The ability to recover a sparse or nearly sparse pa- 
rameter vector from a few observations raises tantalizing opportunities and 
we mention just a few to give concrete ideas: 

1. Biomedical imaging. In the field of biomedical imaging, one is often only 
able to collect far fewer measurements than the number of pixels. In mag- 
netic resonance imaging (MRI), for instance, one would like to reconstruct 
high-resolution images from heavily undersampled frequency data, as this 
would allow image acquisition speeds far beyond those offered by current 
technologies; for example, see [33] and [16] . If the image is sparse, as is 
the case in magnetic resonance angiography (or if its gradient is sparse 
or, more generally, if the image is sparse in a fixed basis [9]), then ii- 
minimization may have a chance to be very effective in such challenging 
settings. 

2. Analog to digital. By making a number n of general linear measurements 
rather than measuring the usual pixels, one could, in principle, recon- 
struct a compressible or sparse image with essentially the same resolu- 
tion as one would obtain by measuring all the pixels. Now suppose one 
could design analog sensors able to make measurements by correlating 
the signal we wish to acquire against incoherent waveforms as discussed 
in the previous sections. Then one would effectively be able to make up 
a digital image with far fewer sensors than what is usually considered 
necessary [14, 19]. 

3. Sensor networks. There are promising applications in sensor networks 
where taking random projections may yield the same distortion (the same 
quality of reconstruction) , but using much less power than what is tradi- 
tionally required [29]. 
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3. Proof of theorems. We now prove Theorems 1.1, 1.2 and 1.3, and we 
introduce some notation that we wih use throughout this section. We let 
X^, . . . G be the p columns of X (the exploratory variables) so that 
Xf3 = (3iX^ + ■■■ + ppXP and {X*y)j = {y, X^),l<j<p. We recall that the 
columns of X are normalized to have unit norm, that is, HX-'H^^ = 1- 

Note that it is sufficient to prove our theorems with £7 = 1, as the general 
case would follow from a simple rescaling argument. Therefore, we assume 
(7 = 1 from now on. Now a key observation is that, with large probability, 
z N{0, In) obeys the orthogonality condition 

(3.1) < Ap for all l<j<p, 

for Xp = y/2\ogp. This is standard and simply follows from the fact that, 
for each j, Zj := {z,X^) ~ iV(0,l). We will see that if (3.1) holds, then 
(1.10) holds. Note that, for each u> 0, P(supj \Zj\ > u) <2p- (j){u)/u, where 

4){u) := (27r)~"^/^e~"^/^, and our quantitative probabilistic statement just 
follows from this bound. Better bounds are possible, but we will not pursue 
these refinements here. As remarked earlier, if the columns were not unit 
normed, one would obtain the same conclusion with Xp = + 6i ■ \/2logp 
since HX-^H^^ ^ vT-j-Tj". 

3.1. High- dimensional geometry. It is probably best to start by intro- 
ducing intuitive arguments underlying Theorems 1.1 and 1.2. These ideas 
are very geometrical and we hope they will convey the gist of the proof. 

Consider Theorem 1.1 first, and suppose that y = X(3 + z, where z obeys 
the orthogonality condition (3.1) for some Xp. Let (3 be the minimizer of 
(1.7). Clearly, the true vector of parameters (3 is feasible and, hence, 

mu.<\mw- 

Decompose f3 as P = (3 + h and let To be the support of (3, Tq = {i: I3i ^ 0}. 
Then h obeys two geometric constraints: 

1. First, as essentially observed in [20], 

where again the ith component of the vector hj-^ is that of /i if i G Tq and 
zero otherwise (similarly for hx^). Hence, h obeys the cone constraint 

(3.2) \\hT^\U,<\\hT,\U,. 

2. Second, since 

{z -r,X^) = {XP- Xp, X^) = {Xh, X^), 
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(a) (b) 



Fig. 1. This figure represents the geometry of the constraints. On the left, the shaded 
area represents the set of h obeying both (3.2) (hourglass region) and (3.3) (slab region). 
The right figure represents the situation in the more general case. 

it follows from the triangle inequality that 
(3.3) \\X*Xh\\e^<2Xp. 

We will see that these two geometrical constraints imply that h is small in 
the ^2-iiorm. In other words, we will show that 

sup subject to ll/irdki < ll/iToIki and \\X*Xh\\i^ <2Xp 

heRp 

(3.4) 

obeys the desired bound, that is, 0{Xp • |To|). This is illustrated in Fig- 
ure 1(a). Hence, our statistical question is deeply connected with the geom- 
etry of high-dimensional Banach spaces, and that of high-dimensional spaces 
in general. 

To think about the general case, consider the set of indices Tq := {i : > 
a} and let Ptq be the vector equal to (3 on Tq and zero outside, /? = Ptq + Pxg ■ 
Suppose now that Ptq were feasible. Then we would have \\P\\ei < ll/?Tolki; 
writing /3 = Ptq + h, the same analysis as that of Theorem 1.1 — and outlined 
above — would give 

\\P-PTo\\l = 0{logp)-\To\-a\ 
From 11/3 - < 2||/3 - /3to \\l + 2||/3 - Pto one would get 
\\P-f3\\l = 0{logp)-\To\-a^ + 2 ^ pf, 

i: |ft|<cr 

which is the content of (1.13). Unfortunately, while Ptq may be feasible for 
"most" S'-sparse vectors /3, it is not for some, and the argument is consid- 
erably more involved. 
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3.2. Proof of Theorem 1.1. 

Lemma 3.1. Suppose To is a set of cardinality S with 5 + 9 < 1. For a 
vector h G , we let Ti be the S largest positions of h outside of Tq . Put 
Tqi = Tq U Ti . Then 

1 9 

ll^ll^2{Toi) < Y3^II^Tbi^^ll^2 + (1_ 5)51/2 11^11^1(7^0^) 

and 

Proof. Consider the restricted transformation Xxg^ : R^°i — > R", Xt^^c := 
J2j£Toi ^jX'' ■ Let V C R"" be the span of {X^ : j € Tqi}. Then V is of course 
the range of Xtq^ and also the orthogonal complement of the kernel of Xj,^^^ , 
which says that R" is the orthogonal sum V © V'^ . Because 5 < 1 , we know 
that the operator Xtqi is a bijection from R"^"^ to V, with singular values 
between \/l — 6 and \/l + 6. As a consequence, for any c G ^2 (Tqi), we have 

Vl - S\\c\\e^ < ||Xtoic||£2 < Vl + S\\c\\e^. 

Moreover, letting Py denote the orthogonal projection onto V, we have for 
each w G R", X^ w = X^ Pvw and it follows that 

(3.5) VT^\\Pvw\\e, < \\X^,,w\\e, < VTT6\\Pvw\\e,. 
We apply this to w := Xh and conclude, in particular, that 

(3.6) \\PvXh\\,, < (1 - 5r^'^Xl^^Xh\\,,. 

The next step is to derive a lower bound on PyXh. To do this, we begin by 
dividing Tq into subsets of size S and enumerate Tq as ni,n2, ■ ■ ■ ,np_\Xg\ in 
decreasing order of magnitude of /it=- Set Tj = {ni, (j — 1)S + 1 < £ < jS}. 
That is, Ti is as before and contains the indices of the S largest coefficients 
of hTg, T2 contains the indices of the next S largest coefficients, and so on. 
Decompose now PyXh as 

(3.7) PyXh = PyXhTo, + J2 PvXhT^ . 

i>2 

By definition, Xh^gi G V and PyXh^gi = Xhxoi ■ Further, since Py is an 
orthogonal projection onto the span of the X-^'s for j G Tqi, PyXhxj = 
X^jeToi '^jX-^ for some coefficients Cj, and the following identity holds: 



(3.8) 



PyXhT, \\l = {PyXhr, , XhT^ ) . 
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By restricted orthogonality followed by restricted isometry, this gives 

1/2 

11^11^2 



VieToi / 



Q 

which upon combining with (3.8) gives 

(3.9) \\PvXhT,\U,<^^^\\hT,\\e,. 

We then develop an upper bound on J2j>2 W^Tj life ^ ™ [12]. By construction, 
the magnitude of each component /iTj+i [i] of /iTj+i is less than the average 
of the magnitudes of the components of Htj ■ 



|/iT,+,WI<|IHIk/s. 

Then ||/irj_|_i life — II^Tjllfe/5' and, therefore, 

(3.10) EllKH^2<'5"'/'EllKlk=^"'^'ll^ll^i(T^)- 
i>2 j>i 



To summarize, Xhroi obeys HX/itoi ||fe > \/l — 5 ||/itui life by restricted isom- 
etry, and since E,>2 WPvXhr^ ||,, < 9(1 - 6y'/^ S-y^\\h\\e,(^TS), 

\\PvXh\u, > VT^\\h\u^^To,) - ^^^s-'/^h\u^^T§). 

Combining this with (3.6) proves the first part of the lemma. 

It remains to argue the second part. Observe that the kth largest value 
of hx^ obeys 

\hT-\{k) < II^T„Hlfe/^ 

and, therefore, 

ii/iT„=jii < ii/iT„Hife E yk^<\\hTs\\l/s, 

k>S+l 

which is what we needed to establish. The lemma is proven. □ 

Theorem 1.1 is now an easy consequence of this lemma. Observe that on 
the one hand, (3.2) gives 

||/lroHlfe < II^Tollfe < 5'^''^ II ^To life, 
while on the other hand, (3.3) gives 

\\Xl^Xh\U,<{2S)'/'-2Xp, 
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since each of the 2S coefficients of Xj^^^Xh is at most 2Ap (3.3). In conclusion, 
we apply Lemma 3.1 and obtain 

ll^llfe(Toi) < i_^_Q • ^25- 2Ap. 
The theorem follows since WhWf^ < 2\\h\\'j^^j,^_^y 

3.3. Proof of Theorem 1.3. The argument underlying Theorem 1.3 is 
almost the same as that of Theorem 1.1. We let Tq be the set of the S largest 
entries of /3, and write P = P + h as before. If z obeys the orthogonality 
condition (1.5), P is feasible and 

ii/?Toik - iiHii^i + ii^Hk - wptsIw < \\p+h\w < mw, 

which gives 

II^TgHk < II^ToIki +2||/3ToHk- 

The presence of the extra term is the only difference in the argument. We 
then conclude from Lemma 3.1 and (3.3) that 

The second part of Lemma 3.1 gives WhWi^ < 2||/i||^2(Toi) + ll/^Tg^Iki " S~^^'^. 
Since for /? obeying the decay condition (1.16), H/Stq'^II^i • S~^^'^ < C ■ R- , 
with r = l/s — 1/2, we have established that, for all S < S^,, 

\\hTs\U,<r——^. (Xp-S^' + R-S-^). 

The theorem follows. 

3.4. Proof of Theorem 1.2. We begin with an auxiliary lemma. 

Lemma 3.2. For any vector /3, we have 

\\xp\u,<viT~6m\i. + {^s)-'/'mu,). 

Proof. Let Ti be the 25" largest positions of /?, then T2 be the next 
largest, and so forth. Then 

\\xp\u,<\\xpTA\i2 + J2\\^f^TM- 

From restricted isometry, we have 

\\xpTAi2 < {i + dy/'\\PTA\i2 < {i + 6)'/'mu. 
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and 

WxPtM < {i + ^Y'^WtM < {i + s)'/\2sr'/'\\PT,.Ak- 

The claim follows. □ 

We now turn to the proof of Theorem 1.2. As usual, we let (5 be the ii 
minimizer subject to the constraints 

(3.11) \\X*{XP-y)\U^ = sup \{X$-y,X^)\<{l + t~')X, 

^<j<p 



where A := \/21ogp for short. 

Without loss of generality, we may order the /?j's in decreasing order of 
magnitude 

(3.12) |/3i|> 1/321 >--->|/3p|. 

In particular, by the sparsity assumption on f3, we know that 

(3.13) pj = for all j> 5. 
In particular, we see that 

^min(/3|,A2) <5- A^. 

j 

Let Sq be the smallest integer such that 

(3.14) 5]min(/3|,A2)<5o-A2; 

j 

thus, < 5o < 5 and 

(3.15) S'o-A2< A2 + ^min(/5|,A2). 

j 

Also, observe from (3.12) that 

So+l 

So-X^>J2 ^ (^0 + 1) min(/3|+i, A2) 

and, hence, min(/3^^^-^, A^) is strictly less than A^. By (3.12), we conclude 
that 

(3.16) (3j<X foranj>S'o. 
Write /3 = where 

/3f =/?i-li<i<So, 
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Thus, is a hard-thresholded version of P, localized to the set 

To :={!,..., 5o}. 
By (3.16), is 5-sparse with 

j>So 

As we shall see in the next section, Corollary A. 3 allows the decomposition 
/3(2) =/3' + /3", where 

(3.17) ll/^'II^^^T37^^-^o^'' 

(3.18) ll/3'll^i<Y37^^-^o 
and 

(3.19) < /_"/_' g A. 

We use this decomposition and observe that 

X*(X(/?(i) + /?') -y) = -X*XP" - X*z 
and, hence, by (3.1) and (3.19), 

(3.20) ||X*(X(/3« +/3') - y)\\,^ < (l + 3^7^)^- 

By assumption, (1 — 5 — 9)~^ < and, therefore, + (3' is feasible, which 
in turn implies 

Put /3 = /3« + Then > - + ||/i||,,(tc) so that 

(3.21) ll^ll^i(T-) < ll^ll^i(To) + YZJZ^^o ■ ^' 
and from (3.11) and (3.20), we conclude that 

(3.22) \\X*X{P' -h)\U^<2(l + -^-^)X. 



1-6-9, 

Figure 1(b) schematically illustrates both these constraints. 
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The rest of the proof is essentially the same as that of Theorem 1.1. By 
Lemma 3.1, we have 

ll^Ollk. < J^\\Xl,Xh\U, + I ,/J h\U^iTSy 

[L — OJOq 

On the other hand, from (3.22), we have 

WXl^Xif - h)\\e, < 2^(^1 + Y^^) S'J' • A, 
while from Lemma 3.2 and (3.18), (3.17), we have 

and, hence, by restricted isometry, 

\\Xl,xne.<il + l/V2)^^^S'J' -X. 

In short, 

\\xl^xh\u,<Co-sl/'-x, 

where Co was defined in (1.14). We conclude that 



Finally, the bound (3.21) gives 



and, hence. 



where 



T'0i\\i2 - ^0 ■ '^'o^^ • A, 



^/ ^ Co 0{l + S) 

°' 1-6-9 (1-6-9)'^' 

Applying the second part of Lemma 3.1 and (3.21), we conclude 

11^11^2 — 2||/ioilk2 + — ^ — 9^^^^ • A < C2 • ^Q^^ • A 
and the claim follows from (3.15). 
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3.5. Refinements. The constant C2 obtained by this argument is not 
best possible; it is possible to lower it further, but at the cost of making 
the arguments slightly more complicated. For instance, in Lemma 3.2 one 
can exploit the approximate orthogonality between X^Ti and the X^Tj 's 
to improve over the triangle inequality. Also, instead of defining Ti,T2,... 
to have cardinality S, one can instead choose these sets to have cardi- 
nality pS for some parameter p to optimize in later. We will not pur- 
sue these refinements here. However, we observe that in the limiting case 
6 = 6 = 0, then X is an orthogonal matrix, and as we have seen earlier, 
Pj = max(|(X*y)j| — A,0) sgn{{X*y)j). In this case one easily verifies that 



<^min(/32,4A2 



This would correspond, roughly speaking, to a value of C2 = 2 in Theorem 
1.2, and therefore shows that there is room to improve C2 by a factor of 
roughly 8. 

4. Numerical experiments. This section presents numerical experiments 
to illustrate the Dantzig selector and gives some insights about the numerical 
method for solving (1.9). 

4.1. An illustrative example. In this first example, the design matrix X 
has n = 72 rows and p = 256 columns, with independent Gaussian entries 
(and then normalized so that the columns have unit norm). We then select 
P with S := \{i : Pi ^ 0}\ = 8 and form y = X(5 + z, where the Zis are i.i.d. 
A^(0,(T^). The noise level is adjusted so that 




Here and below, the regularizing parameter Ap in {DS) is chosen via Monte 
Carlo simulations, that is, as the empirical maximum of over several 

realizations of 2; ~ N{0,ln). The results are presented in Figure 2. 

First, we note that in this example our procedure correctly identifies all 
the nonzero components of and correctly sets to zero all the others. Quan- 
titatively speaking, the ratio between the squared error loss and the ideal 
squared error (1.11) is equal to 

(4.1) p' := = 10.28. 

Eimm(/3/,(T^) 

(Note that here 21ogp = 11.09.) Second, and as essentially observed earlier, 
the method clearly exhibits a soft-thresholding type of behavior and as a 
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Fig. 2. Estimation from y = Xp -\- z with X a 72 by 256 matrix with independent Gaus- 
sian entries. A blue star indicates the true value of the parameter and a red circle the 
estimate. In this example, a = 0.11 and X = 3.5 so that the threshold is at 5 = X - a — 0.39. 
(a) Dantzig selector (1.7). Note that our procedure correctly identifies all the nonzero com- 
ponents of P, and correctly sets to zero all the others. Observe the soft-thresholding-like 
behavior, (b) Estimation based on the two-stage strategy (1.20). The signal and estimator 
are very sparse, which is why there is a solid red line at zero. 



result, tends to underestimate the true value of the nonzero parameters. 
However, the two-stage Dantzig selector (1.20) introduced in Section 1.6 
corrects for this bias. When applied to the same dataset, it recenters the es- 
timator, and yields an improved squared error since now = 1.14; compare 
the results of Figures 2(a) and 2(b). 

In our practical experience, the two-stage or Gauss-Dantzig selector pro- 
cedure tends to outperform our original proposal, and to study its typical 
quantitative performance, we performed a series of experiments designed as 
follows: 

1. X is a 72 by 256 matrix, sampled as before {X is fixed throughout); 

2. select a support set T of size \T\ = S uniformly at random, and sample 
a vector /3 on T with independent and identically distributed entries 
according to the model 

(3i = ei{l + |ai|), 

where the sign si = ±1 with probability 1/2, and Oj ~ A^(0, 1) (the moduli 
and the signs of the amplitudes are independent); 

3. make y = X(3 + z, with z ~ A^(0, cr'^In) and compute (3 by means of the 
two-stage procedure (1.20); 

4. repeat 500 times for each S, and for different noise levels a. 

The results are presented in Figure 3 and show that our approach works 
well. With the squared ratio as in (4.1), the median and the mean of 



THE DANTZIG SELECTOR 25 




100 120 



{a) (I.) 



Fig. 3. Statistics of the ratio between the squared error '^^{Pi — Pif and the ideal 
mean-squared error mm(/3j^, cr^) . (a) 5 = 8, cr = l/S-^/S/n = 0.11, and the threshold 
is here X ■ a = 0.5814. (b) 5 = 8, a = ^ S/n = 0.33, and the threshold is now X ■ a = 1.73. 



are 2.35 and 9.42, respectively, for a noise level a set at l/3-\/ S/n. In 
addition, 75% of the time is less than 10. With a = y^S/n, the mean and 
median are 12.38 and 13.78, respectively, with a bell-shaped distribution. 

The reader may wonder why the two histograms in Figure 3 look some- 
what different. The answer is simply due to the fact that in Figure 3(a) 
a = 0.11 and the threshold is about X ■ a = 0.5814, which means that the 
nonzero /3i's are above the noise level. In Figure 3(b), however, a = 0.33, 
and the threshold is A • c = 1.73. This means that a fraction of the nonzero 
components of (3 are within the noise level and will be set to zero. This 
explains the observed difference in the quantitative behavior. 

4.2. Binary design matrices. We now consider the case where the design 
matrix has i.i.d. entries taking on values ±1, each with probability 1/2 (the 
entries are then divided by ^/n so that the columns have unit norm). This 
simply amounts to measuring differences among randomly selected subsets 
of coordinates. Note that if X had 0/1 entries, one would measure the aggre- 
gate effect of randomly selected subsets of coordinates, much like in pooling 
design schemes. The number of predictors is set to p = 5,000 and the num- 
ber of observations to n = 1,000. Of interest here is the estimation accuracy 
as the number S of significant parameters increases. The results are pre- 
sented in Table 1. In all these experiments, the nonzero coordinates of the 
parameter vector are sampled as in Section 4.1 and the noise level is ad- 
justed to cr = l/3-\/5/n, so that with y = XP + z, the variance E||z|p = na'^ 
is proportional to E||/3|p with the same constant of proportionality (fixed 
signal-to- noise ratio) . 
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Table 1 

Ratio between the squared error — A)^ o,nd the ideal mean squared error 

mm(/3|, (7^) . The binary matrix X is the same in all these experiments, and the noise 

level a = l/Sy^ S/n is adjusted so that the signal-to-noise ratio is nearly constant. Both 
estimators and especially the Gauss-Dantzig selector exhibit a remarkable performance 
until a breakdown point around S = 200 
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20 


50 


100 


150 


200 


Dantzig selector 


22.81 


17.30 


28.85 


18.49 


25.71 


49.73 


74.93 


Gauss-Dantzig selector 


0.36 


0.65 


1.04 


1.09 


1.53 


13.71 


48.74 



Our estimation procedures and most notably the Gauss-Dantzig selector 
are remarkably accurate as long as the number of significant parameters 
is not too large, here about S = 200. For example, for S = 100, the ratio 
between the Gauss-Dantzig selector's squared error loss and the ideal mean- 
squared error is only 1.53. Figure 4 illustrates the estimation precision in 
this case. 

To confirm these findings, we now sample the amplitudes of the parameter 
vector (3 according to a Cauchy distribution in order to have a wide range 
of component values some of which are within the noise level, while 
others are way above; X is fixed and we now vary the number S of nonzero 
components of (3 as before, while a = 0.5 is now held constant. The results 
are presented in Table 2. Again, the Gauss-Dantzig selector performs well. 







• o 

g ^ - 
s f # u ■ s e 
s . 1 n S. 


o S 






\ » 




^ B ' ? ft ' 






a e 




a 


t • 












. li . 



500 1 000 1 500 2000 2500 MOO 3M0 JOOO A500 5000 
(a) 




50 100 150 200 250 300 350 400 450 600 
ih) 



Fig. 4. Estimation from y = XjS -\- z with X a 1,000 by 5,000 matrix with independent 
binary entries. A blue star indicates the true value of the parameter and a red circle 
the estimate, (a) True parameter values and estimates obtained via the Dantzig selector 
(1.20). There are S = 100 significant parameters, (b) Same plot but showing the first 500 
coordinates for higher visibility. 
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Table 2 

Ratio between the squared error 5^^(/3i — ft)^ o,i^d, the ideal mean squared error 
y^^- mm{l3f ,a^). The binary matrix X , a — 0.5 and A ■ ct = 2.09 are the same in all these 

experiments 
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Gauss-Dantzig selector 


3.70 


4.52 


2.78 


3.52 


4.09 


6.56 


5.11 



4.3. Examples in signal processing. We are interested in recovering a 
one-dimensional signal / € from noisy and undersampled Fourier coeffi- 
cients of the form 

yj = {f,(l>j) + i<i<n, 

where (pj{t), t = 0,. . . ,p—l, is a sinusoidal waveform (j)j{t) = ■\/2/ncos(7r(fcj + 
l/2)(i + 1/2)), kj G {0, 1, . . . ,p — 1}. Consider the signal / in Figure 5; / is 
not sparse, but its wavelet coefficients sequence /3 is. Consequently, we may 
just as well estimate its coefficients in a nice wavelet basis. Letting $ be 
the matrix with the (pk's as rows, and W be the orthogonal wavelet matrix 
with wavelets as columns, we have y = Xf3 + z, where X = , and our 
estimation procedure applies as is. 

The test signal is of size p = 4,096 (Figure 5), and we sample a set of 
frequencies of size n = 512 by extracting the lowest 128 frequencies and ran- 
domly selecting the others. With this set of observations, the goal is to study 
the quantitative behavior of the Gauss-Dantzig selector procedure for var- 
ious noise levels. [Owing to the factor in the definition of 4'jit)j 
columns of X have size about one and for each column, individual thresholds 




(fi) (b) 

Fig. 5. (a) One- dimensional signal f we wish to reconstruct, (b) First 512 wavelet co- 
efficients of f. 
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Table 3 

Performance of the Gauss-Dantzig procedure in estimating a signal from undersampled 
and noisy Fourier coefficients. The subset of variables is here estimated by \f3i \ > a /i, 
with as in (1.7). The top row is the value of the signal-to-noise ratio (SNR) 



SNR a = \\X(3\\/Vn(T^ 
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15.51 


2.08 


1.40 


1.47 


0.91 


1.00 



Xi — |(X*r)j| < Aj • a — are determined by looking at the empirical distribu- 
tion of \iX*z)i\.] We adjust a so that = ||X/?|||^/E||z||| = \\Xp\\jJna^ 
for various levels of the signal-to-noise ratio a. We use Daubechies' wavelets 
with four vanishing moments for the reconstruction. The results are pre- 
sented in Table 3. As one can see, high statistical accuracy holds over a 
wide range of noise levels. Interestingly, the estimator is less accurate when 
the noise level is very small (a = 100), which is not surprising, since in this 
case there are 178 wavelet coefficients exceeding a in absolute value. 

In our last example, we consider the problem of reconstructing an image 
from undersampled Fourier coefficients. Here f3(ti,t2), <ti,t2 < N, is an 
unknown by image so that p is the number of unknown pixels, p = N"^. 
As usual, the data is given by y = XjS -\- z, where 

(4.2) (A/3)fc = ^/3(ti,t2)cos(27r(feiii + M2)/Af), k = {kiM). 

or {XP)k = X]ti,t2/3(^i)*2)sin(27r(A;iti + k2t2)/N). In our example [see Fig- 
ure 6(b)], the image (3 is not sparse, but the gradient is. Therefore, to re- 
construct the image, we apply our estimation strategy and minimize the 
^i-norm of the gradient size, also known as the total-variation of /3, 

(4.3) min||/3||TV subject to \{X*r)i\ < Xi ■ a 

(the individual thresholds again depend on the column sizes as before); for- 
mally, the total-variation norm is of the form 

II/3||tv = E \j\DihtiM)\'' + \D2ktiM)\\ 
tiM 

where Di is the finite difference DiP = f3{ti,t2) — (3{ti — 1,^2) and D2P = 
/5(*i)*2) — /3(ti,i2 — 1); in short, ||/3||bv is the ^i-norm of the size of the 
gradient Df5 = {Di(3, D2f3); see also [34]. 

Our example follows the data acquisition patterns of many real imaging 
devices which can collect high-resolution samples along radial lines at rel- 
atively few angles. Figure 6(a) illustrates a typical case where one gathers 
N = 256 samples along each of 22 radial lines. In a first experiment then, 
we observe 22 x 256 noisy real-valued Fourier coefficients and use (4.3) for 
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Fig. 6. (a) Sampling "domain'^ in the frequency plane; Fourier coefficients are sampled 
along 22 approximately radial lines; here, n « 0.086p. (b) The Logan-Shepp phantom test 
image, (c) Minimum energy reconstruction obtained by setting unobserved Fourier coeffi- 
cients to zero, (d) Reconstruction obtained by minimizing the total-variation, as in (4.3). 



the recovery problem illustrated in Figure 6. The number of observations is 
then n = 5,632, whereas there are p = 65,536 observations. In other words, 
about 91.5% of the 2D Fourier coefficients of [3 are missing. The SNR in 
this experiment is equal to ||X/?||£2/||z||£2 =5.85. Figure 6(c) shows the re- 
construction obtained by setting the unobserved Fourier coefficients to zero, 
while (d) shows the reconstruction (4.3). We follow up with a second exper- 
iment where the unknown image is now 512 by 512 so that p = 262,144 and 
n = 22 X 512 = 11,264. The fraction of missing Fourier coefficients is now 
approaching 96%. The SNR ratio is about the same, ||X/3||£2/lklk2 — 5.77. 
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Fig. 6. Continued, (e) Magnitude of the true Fourier coefficients along a radial line 
(frequency increases from left to right) on a logarithmic scale. Blue stars indicate values 
o/log(l + |(X/3)fc|), while the solid red line indicates the noise level log(l + a). Less than 
a third of the frequency samples exceed the noise level, (f) X* z and jS are plotted along a 
scanline to convey a sense of the noise level. 



The reconstructions are of very good quality, especially when compared to 
the naive reconstruction which minimizes the energy of the reconstruction 
subject to matching the observed data. Figure 7 also shows the middle hori- 
zontal scanline of the phantom. As expected, we note a slight loss of contrast 
due to the nature of the estimator which here operates by "soft-thresholding" 
the gradient. There are, of course, ways of correcting for this bias, but such 
issues are beyond the scope of this paper. 

4.4. Implementation. In all the experiments above, we used a primal- 
dual interior point algorithm for solving the linear program (1.9). We used 
a specific implementation which we outline, as this gives some insight about 
the computational workload of our method. For a general linear program 
with inequality constraints 

mine*/? subject to Fj3<b, 

define 

. /(/?)=X/3-6, 

• rduai = c + X*A, 

• ''cent = -diag(A)/(/3) - 

where A G R™ are the so-called dual variables, and t is a parameter whose 
value typically increases geometrically at each iteration; there are as many 
dual variables as inequality constraints. In a standard primal-dual method 
(with logarithmic barrier function) [8], one updates the current primal-dual 
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pair (/3,A) by means of a Newton step, and solves 

-diag(A)F -diag(/(/3))y' Ucent 7 ' 

that is, 

AA = - diag(l//(/3))(diag(A)FA^ - w), 

-[F* diag(A//(/3))F]A/3 = -(rduai + F* diag(l//(/3)) w). 

The current guess is then updated via , A+) = {(3, A) + s{Af3, As), where 
the stepsize s is determined by hne search or otherwise. Typicahy, the se- 
quence of iterations stops once the primal-dual gap and the size of the resid- 
ual vector fall below a specified tolerance level [8]. 
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Letting U = X*X and y ■■ 
block structure 



F 



which gives 

F*diag(A//)F: 
where A = diag(Ai//i), 



X*y in (1.9), our problem parameters have the 
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Put 



^'duai + F* diag(l//(/3))rccnt- It is convenient to solve the system 
F*diag(A//)F 



A/3 
Au 



D2)r2, 



by elimination and obtain 

{4{Di + D2)-^DiD2 + U*iD3 + Di)U)Ap = n - {Di + L'a)"^^! 

{Di + D2)Au = r2- {D2 - Di)Ap. 

In other words, each step may involve solving a p hy p system of linear 
equations. In fact, when n is less than p, it is possible to solve a smaller 
system thanks to the Sherman-Woodbury-Morrison formula. Indeed, write 
U*{D3 + Di)U = X*B, where B = X{Ds + Di)X*X and put Du = 4(L»i + 
D2)~^DiD2. Then 



{D12 + X'^B)-^ = D^^ - D^^X*{I + BD^^X*y^BD 



12 ■ 



The advantage is that one needs to solve the smaller n by n system (/ + 
BD^2-^*)P' — b' ■ Hence, the cost of each Newton iteration is essentially 
that of solving an n by n system of linear equations, plus that of forming 
the matrix (/ + BDY2X*). As far as the number of Newton iterations is 
concerned, we ran thousands of experiments and have never needed more 
than 45 Newton iterations for convergence. 

Note that in some important applications, we may have fast algorithms for 
applying X and X* to an arbitrary vector, as in the situation where X is a 
partial Fourier matrix, since one can make use of FFT's. In such settings, one 
never forms X*X and uses iterative algorithms such as Conjugate Gradients 
for solving such linear systems; of course, this speeds up the computations. 

Finally, a collection of MATLAB routines solving (1.9) for reproducing 
some of these experiments and testing these ideas in other setups is available 
at the address www.ll-magic.org/. 
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5. Discussion. 

5.1. Significance for model selection. We would like to briefly discuss 
the implications of this work for model selection. Given a data matrix X 
(with unit-normed columns) and observations y = Xf3 + z, our procedure 
will estimate [5 by that vector with minimum ^i-norm among all objects (3 
obeying 



dinates f5i will typically be zero (at least under the assumption that the 
true unknown vector (3 is sparse) and, therefore, our estimation procedure 
effectively returns a candidate model / := {i : /3j 7^ 0}. 

As we have seen, the Dantzig selector is guaranteed to produce optimal 
results if 



[note that since 0{X)s,2S 1^ ^{X)3S^ it would be sufficient to have 6{X)2s + 
d{X)^s < !]• We have commented on the interpretation of this condition 
already. In a typical model selection problem, X is given; it is then natural 
to ask if this particular X obeys (5.2) for the assumed level S of sparsity. 
Unfortunately, obtaining an answer to this question might be computation- 
ally prohibitive, as it may require checking the extremal singular values of 
exponentially many submatrices. 

While this may represent a limitation, two observations are in order. First, 
there are empirical evidence and theoretical analysis suggesting approximate 
answers for certain types of random matrices, and there is nowadays a sig- 
nificant amount of activity developing tools to address these questions [17]. 
Second, the failure of (5.2) to hold is in general indicative of a structural 
difficulty of the problem, so that any procedure is likely to be unsuccessful 
for sparsity levels in the range of the critical one. To illustrate this point, 
let us consider the following example. Suppose that 62s + ^5,25 > 1- Then 
this says that 828 is large and since 6s is increasing in 5, it may very well 
be that for S' in the range of S, for example, S' = 3S, there might be 
submatrices (in fact, possibly many submatrices) with S' columns which 
are either rank-deficient or which have very small singular values. In other 
words, the significant entries of the parameter vector might be arranged in 
such a way so that even if one knew their location, one would not be able 
to estimate their values because of rank-deficiency. This informally suggests 
that the Dantzig selector breaks down near the point where any estimation 
procedure, no matter how intractable, would fail. 

Finally, it is worth emphasizing the connections between RIC [26] and 
(1.7). Both methods suggest a penalization which is proportional to the 




(5.2) 



6{X)2S+e{X)s,2S<l 
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logarithm of the number of explanatory variables — a penalty that is well 
justified theoretically [7]. For example, in the very special case where X is 
orthonormal, p = n, RIC applies a hard-thresholding rule to the vector X*y 
at about 0{y/2\ogp), while our procedure translates in a soft-thresholding 
at about the same level; in our convex formulation (1.7), this threshold level 
is required to make sure that the true vector is feasible. In addition, the 
ideas developed in this paper have broad applicability, and it is likely that 
they might be deployed and give similar bounds for RIC variable selection. 
Despite such possible similarities, our method differs substantially from RIC 
in terms of computational effort since (1.7) is tractable while RIC is not. In 
fact, we are not aware of any work in the model selection literature which is 
close in intent and in the results. 

5.2. Connections with other work. In [10] a related problem is studied 
where the goal is to recover a vector /3 from incomplete and contaminated 
measurements y = X[3 + e, where the error vector (stochastic or determin- 
istic) obeys HeHl^ < D. There, the proposed reconstruction searches, among 
all objects consistent with the data y, for that with minimum ^i-norm, 



Under essentially the same conditions as those of Theorems 1.1 and 1.2, 
[10] showed that the reconstruction error is bounded by 



In addition, it is also argued that, for arbitrary errors, the bound (5.4) is 
sharp and that, in general, one cannot hope for a better accuracy. 

If one assumes that the error is stochastic as in this paper, however, a 
mean squared error of about D might be far from optimal. Indeed, with 
the linear model (1.1), D ~ cr'^Xn and, therefore, D has size about no"^. But 
suppose now [3 is sparse and has only three nonzero coordinates, say, all 
exceeding the noise level. Then whereas Theorem 1.1 gives a loss of about 
3(7^ (up to a log factor), (5.4) only guarantees an error of size about na"^ . 
What is missing in [10] and is achieved here is the adaptivity to the unknown 
level of sparsity of the object we try to recover. Note that we do not claim 
that the program {P2) is ill-suited for adaptivity. It is possible that refined 
arguments would yield estimators based on quadratically constrained ^i- 
minimization [variations of (5.3)] obeying the special adaptivity properties 
discussed in this paper. 

Last but not least, and while working on this manuscript, we became 
aware of related work [29]. Motivated by recent results [11, 14, 18], the au- 
thors studied the problem of reconstructing a signal from noisy random pro- 
jections and obtained powerful quantitative estimates which resemble ours. 



(5.3) 



min \\(3\\i,^ subject to ||y - X(3\\\ < D. 



(5.4) 




THE DANTZIG SELECTOR 



35 



Their setup is different tliougli, since tliey exclusively work with random 
design matrices, in fact, random Rademacher projections, and do not study 
the case of fixed X's. In contrast, our model for X is deterministic and does 
not involve any kind of randomization, although our results can of course 
be specialized to random matrices. But more importantly, and perhaps this 
is the main difference, their estimation procedure requires solving a combi- 
natorial problem much like (1.15), whereas we use linear programming. 



This appendix justifies the construction of a pseudo-hard thresholded vec- 
tor which obeys the constraints; see (3.17), (3.18) and (3.19) in the proof of 
Theorem 1.2. 

Lemma A.l (Dual sparse reconstruction, £2 version). Let S s.t. 5 + 6 < 

1, and let ct be supported on T for some \T\ < 2S. Then there exist (3 
supported on T , and an exceptional set E disjoint from T with 



APPENDIX 



(6.1) 



E\ <S, 



such that 



(6.2) 




for all j & T 



and 



(6.3) 



\{Xf3,X^)\< 



9 



for all J ^ (T U E) 



{i-s)Vs 



and 



(6.4) 




Also we have 



(6.5) 




and 



(6.6) 




Proof. We define P by 
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and zero outside of T, which gives (6.2), (6.5) and (6.6) by Cauchy-Schwarz. 
Note that X[5 = XtPt- We then set 

E := |j (^T:\{Xf3,X^)\ > -_^-^||cT|k,}, 

so (6.3) holds. 

Now if T' is disjoint from T with \T'\ < S and dx' is supported on T' , 
then 

\{X'f,XTPT,dT')\ < 9\\PT\\i2\\(iT'\\£2 

and, hence, by duahty, 

(6.7) \\X*XTPT\\i2iT') < OIWtIU^ < Y^llcTllfe- 

If \E\ > S, then we can find T' C E with \T'\ = S. But then we have 



\X*XTpT\\e2{T') > (l_^)5l/2 ll^^ll^2 



ji/|l/2 



which contradicts (6.7). Thus, we have (6.1). Now we can apply (6.7) with 
T' to obtain (6.4). □ 

Corollary A. 2 (Dual sparse reconstruction, ^oo version). Let ct he 
supported on T for some \T\ < S. Then there exists [5 obeying (6.2) such 
that 

(6.8) \{X(3,Xi)\ < \ \\cT\\e, for all j i T. 

[L — — o)\/ b 

Furthermore, we have 
and 

(6.10) ||/3|k<Y^|^||cT|k. 

Proof. The proof of this lemma operates by iterating the preceding 
lemma as in Lemma 2.2 of [13]. We simply rehearse the main ingredients 
and refer the reader to [13] for details. 

We may normalize X^jeT kjP = 1- Write Tq : = T and note that |To| < 25". 
Using Lemma A.l, we can find a vector [3^^^ and a set Ti C {1, . . . ,p} such 



that 
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TonTi=0, 
\Ti\<S, 

(X/3(^) , X^) = Cj for all i G Tq, 
I {XpW , I < for all i ^ To U Ti , 



J2\{xp^'\x^)\ 



\ 1/2 

2 1 



< 



1-6' 
1 

'25 



ii/3«ik<r_,- 

Applying Lemma A.l iteratively gives a sequence of vectors £ and 

sets T„+i C {1, . . . ,p} for all n > 1 with the properties 

r„+in(rour„) = 0, 

l^n+ll < S, 

,X^) = , X^') for all j G r„, 

, X^ ) = for ah j G Tq, 



I , X^)\ < j^^^ [y^^] Vj ^ To U U T.+i , 



l-<5 Vl-<5 



("+l)|L < 



By hypothesis, we have < 1. Thus, if we set 

n=l 

then the series is absolutely convergent and, therefore, /3 is a well-defined 
vector. And it turns out that /? obeys the desired properties; see Lemma 2.2 
in [13]. □ 
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Corollary A. 3 (Constrained thresholding). Let (3 be S-sparse such 
that 

m\i,<x-s'/' 

for some A > 0. Then there exists a decomposition (3 = (3' + (3" such that 

\\0%.< 



and 



1-6- 

^ 1+'^ ml 



Proof. Let 

T:={j:\{X(3,X^)\>{l + 6)X}. 

Suppose that \T\ > S. Then we can find a subset T' of T with \T'\ = S. Then 
by restricted isometry we have 

1^2' 



(1 + 6fX'S < ^ \{XP,X^)\' < (1 + 6)\\XP\\l < (1 + 5)2" 



jeT' 

contradicting the hypothesis. Thus, \T\ < S. Applying restricted isometry 
again, we conclude 

{l + 5fX^\T\ < \{XP,X^)\^ < (l + <5)2ii«ii2 

and, hence, 

in < 5:= 



A2 ■ 

Applying Corollary A. 2 with cj := {XP,X^), we can find a /?' such that 
{XP' ,X^) = {XP, X^ ) for all j G T, 

{1 + 6)VS..^.. (1 + 6) '""'2 



< 



'"^ - i-s-e ' i-s-e X 

and 
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By definition of S, we tlius have 

\{X(3',X^)\ < ^_1_q {1 + S)X for all j i T. 

Meanwhile, by definition of T, we have 

I {X(3, X^') I < (1 + 6)\ for all j ^ T. 
Setting 13" :=(3- f3\ the claims follow. □ 
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